Search CORE

7 research outputs found

A Transport-Layer Network for Distributed FPGA Platforms

Author: Arvind
Ming Liu
Sang-Woo Jun
Shuotao Xu
Publication venue
Publication date: 01/05/2020
Field of study

Abstract-We present a transport-layer network that aids developers in building safe, high-performance distributed FPGA applications. Two essential features of such a network are virtual channels and end-to-end flow control. Our network implements these features, taking advantage of the low error characteristic of a rack level FPGA network to implement a low overhead credit based end-to-end flow control. Our design has many parameters in the source code which can be set at the time of FPGA synthesis, to provide flexibility in setting buffer size and flow control credits to make best use of scarce on-chip memory resources and match the traffic pattern of a virtual channel. Our prototype cluster, which is composed of 20 Xilinx VC707 boards, each with 4 20Gb/s serial links, achieves effective bandwidth of 85% of the maximum physical bandwidth, and a latency of 0.5us per hop. User feedback suggest that these features make distributed application development significantly easier

CiteSeerX

BlueDBM: An Appliance for Big Data Analytics

Author: Ankcorn John
Arvind Arvind
Hicks Jamey
Jun SangWoo
King Myron Decker
Lee Sungjin
Liu Ming Gang
Xu Shuotao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/06/2015
Field of study

Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.Quanta Computer (Firm)Samsung (Firm)Lincoln Laboratory (PO7000261350)Intel Corporatio

DSpace@MIT

Scalable distributed flash-based key-value store

Author: Xu Shuotao
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2016
Field of study

Thesis: S.M. in Computer Science and Engineering, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 63-68).Low-latency and high-bandwidth access to a large amount of data is a key requirement for many web applications in data centers. To satisfy such a requirement, a distributed inmemory key-value store (KVS), such as memcached and Redis, is widely used as a caching layer to augment the slower persistent backend storage (e.g. disks) in data centers. DRAMbased KVS is fast key-value access, but it is difficult to further scale the memory pool size because of cost, power/thermal concerns and floor plan limits. Flash memory offers an alternative as KVS storage media with higher capacity per dollar and less power per byte. However, a flash-based KVS software running on an x86 server with commodity SSD cannot harness the full potential device performance of flash memory, because of overheads of the legacy storage I/O stack and relatively slow network in comparison with faster flash storage. In this work, we examine an architecture of a scalable distributed flash-based key-value store to overcome these limitations. BlueCache consists of low-power hardware accelerators which directly manage raw NAND flash chips and also provide near-storage network processing. We have constructed a BlueCache KVS cluster which achieve the full potential performance of flash chips, and whose throughput directly scales with the number of nodes. BlueCache is 3.8x faster and 25x lower power consumption than a flash-backed KVS software running on x86 servers. As a data-center caching solution, BlueCache becomes a superior choice when the DRAM-based KVS has more than 7.7% misses due to limited capacity. BlueCache presents an attractive point in the cost-performance trade-off for data-center-scale key-value system.by Shuotao Xu.S.M. in Computer Science and Engineerin

DSpace@MIT

Computing Big-Data Applications Near Flash

Author: Xu Shuotao
Publication venue: Massachusetts Institute of Technology
Publication date: 23/06/2021
Field of study

Current systems produce a large and growing amount of data, which is often referred to as Big Data. Providing valuable insights from this data requires new computing systems to store and process it efficiently. For a fast response time, Big Data typically relies on in-memory computing, which requires a cluster of machines with enough aggregate DRAM to accommodate the entire datasets for the duration of the computation. Big Data typically exceeds several terabytes, therefore this approach can incur significant overhead in power, space and equipment. If the amount of DRAM is not sufficient to hold the working-set of a query, the performance deteriorates catastrophically. Although NAND flash can provide high-bandwidth data access and has higher capacity density and lower cost per bit than DRAM, flash storage has dramatically different characteristics than DRAM, such as large access granularity and longer access latency. Therefore, there are many challenges for Big-Data applications to enable flash-centric computing and achieve performance comparable to that of in-memory computing. This thesis presents flash-centric hardware architectures that provide high processing throughput for data-intensive applications while hiding long flash access latency. Specifically we describe two novel flash-centric hardware accelerators, BlueCache and AQUOMAN. These systems lower the cost of two common data-center workloads, key-value cache and SQL analytics. We have built BlueCache and AQUOMAN using FPGAs and flash storage, and show that they can provide competitive performance of computing Big-Data applications with multi-terabyte datasets. BlueCache provides a 10-100X cheaper key-value cache than DRAM-based solution, and can outperform DRAM-based system when the latter has more than 7.4% misses for a read-intensive workloads. A desktop-class machine with single instance of 1TB AQUOMAN disk can achieve performance similar to that of a dual-socket general-purpose server with off-the-shelf SSDs. We believe BlueCache and AQUOMAN can bring down the cost of acquiring and operating high-performance computing systems for data-center-scale Big-Data applications dramatically.Ph.D

DSpace@MIT

Fcuda Memory Subsystems

Author: Xu Shuotao
Publication venue
Publication date
Field of study

FCUDA is an innovative design flow, which transforms high-level parallel codes for GPUs into FPGA configurations for higher performance and lower power consumption. FCUDA is a joint project involving multiple research groups including the ECE department of University of Illinois at Urbana-Champaign (Deming Chen and Wen-Mei Hwu’s groups), the CS department of University of California at Los Angeles (Jason Cong’s group), and ADSC (Advanced Digital Science Center) in Singapore. The research task was to design customizable and lightweight high-bandwidth memory communication subsystems that connect multiple custom cores instantiated on the FPGA device to the off-chip DDR2 memory. Achieving high bandwidth for FPGA devices is especially challenging due to fewer memory channels compared to GPU and CPU. Meanwhile, the current standard bus systems offered by FPGA vendors are associated with high delay and area overhead due to the support of reconfigurability; thus these systems cannot deliver high-speed data transfers. The first part of the work is to modify the existing modules of DDR2 memory controller generated by Xilinx ISE tool to interface with FCUDA-generated cores. In order to overcome an intrinsically fixed burst length of a standard memory controller, a module enables a sequential burst of burst to support high-performance parameterizable memory transfers are then designed. Data alignment is paid special attention when large amount datum bursted.unpublishednot peer reviewedU of I OnlyUndergraduate senior thesis not recommended for open acces

Illinois Digital Environment for Access to Learning and Scholarship Repository

AQUOMAN: An Analytic-Query Offloading Machine

Author: Arvind Arvind
Bourgeat Thomas
Huang Tianhao
Kim Hojun
Lee Sungjin
Xu Shuotao
Publication venue: IEEE Computer Society
Publication date: 21/10/2020
Field of study

Analytic workloads on terabyte data-sets are often run in the cloud, where application and storage servers are separate and connected via network. In order to saturate the storage bandwidth and to hide the long storage latency, such a solution requires an expensive server cluster with sufficient aggregate DRAM capacity and hardware threads. An alternative solution is to push the query computation into storage servers. In this paper we present an in-storage Analytics QUery Offloading MAchiNe (AQUOMAN) to offload most SQL oper- ators, including multi-way joins, to SSDs. AQUOMAN executes Table Tasks, which apply a static dataflow graph of SQL operators to relational tables to produce an output table. Table Tasks use a streaming computation model, which allows AQUOMAN to process queries with a reasonable amount of DRAM for intermediate results. AQUOMAN is a general analytic query processor, which can be integrated in the database software stack transparently. We have built a prototype of AQUOMAN in FPGAs, and using TPC-H benchmarks on 1TB data sets, shown that a single instance of 1TB AQUOMAN disk, on average, can free up 70% CPU cycles and reduce DRAM usage by 60%. One way to visualize this saving is to think that if we run queries sequentially and ignore inter-query page cache reuse, MonetDB running on a 4-core, 16GB-DRAM machine with AQUOMAN augmented SSDs performs, on average, as well as a MonetDB running on a 32-core, 128GB-DRAM machine with standard SSDs. © 2020 IEEE Computer Society. All rights reserved

Crossref

DGIST Library Institutional Repository